Data Source: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020

This dataset contains data related to Formula 1 races, car races. There are variables explaining the time of a pit-stop, that is the time that a car takes to change it?s wheels on boxes; the driver of that car, the number of stops that car did already in the race and the lap number in which the car did that stop.

In this Study we are going to analyze the time that the car loses in stopping to change it?s wheels, i.e, our ideal target is to predict the time of a pit stop in milliseconds.

First of all we load the data and remove redundant columns.

rm(list = ls())
setwd("C:/Users/alejo/OneDrive - Universidad Carlos III de Madrid/UC3M/Año 3/Segundo Cuatri/Bayesian/project/data")


data  <- read.csv("pit_stops.csv")
drivers  <- read.csv("drivers.csv")
races  <- read.csv("races.csv")
circuits = read.csv("circuits.csv")

data = data[, -6] #remove time in seconds. 
data = data[, -5] #remove time in hours 

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(MCMCpack)
## Warning: package 'MCMCpack' was built under R version 4.2.3
## Loading required package: coda
## Warning: package 'coda' was built under R version 4.2.3
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     drivers
## ##
## ## Markov Chain Monte Carlo Package (MCMCpack)
## ## Copyright (C) 2003-2023 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
## ##
## ## Support provided by the U.S. National Science Foundation
## ## (Grants SES-0350646 and SES-0350613)
## ##
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
summary(data$milliseconds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12897   21908   23557   72402   26187 3069017
ggplot(data=data) + aes(y = milliseconds) + geom_boxplot(fill = "darkgreen") + ggtitle("Pit Stop Time (ms)")

Here we can see that we have some pit stops of around 300 seconds. A normal pit stop should not last longer than 30-40 seconds, so that is quite unusual, we draw the boxplot to check for outliers and I have decided to remove the pit stops that last longer than 40 seconds, as they are the ones were there car has suffered some kind of accident and many parts of the car had to be replaced.

We are going to focus on analyzing only the times where the car only changes it?s wheels as we do not have data supporting the condition in which the car arrives to the box.

data = data[!(data$milliseconds >= 40000),] #removing rows with pit_stops that last more than 40 secs. 

ggplot(data=data) + aes(y = milliseconds) + geom_boxplot(fill = "darkgreen") + ggtitle("Pit Stop Time (ms)")

# We could even decrease more the stop time, let?s just check: 
ggplot(data=data) + aes(y = milliseconds) + ylim(0,35000)+  geom_boxplot(fill = "darkgreen") + ggtitle("Pit Stop Time (ms)")
## Warning: Removed 165 rows containing non-finite values (`stat_boxplot()`).

summary(data$milliseconds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12897   21831   23376   24142   25471   39980

We have removed all the pit-stops that last longer than 40 seconds, that are the ones where some problems related to the car were produced and we cannot control them.

As our target is to predict the time for a pit stop and we have around 200 different races, it is important to have instead of a race id, the circuit in which that race was run so that the same circuits will have similar pit stop times as the shape of the track is the same.

Now I am going to add to my data from external datasets, the circuitReference and the driverReference

so that we can use them as categorical values.

old_data = data

#Let?s run the data that joins the races with the circuitIds and merge it with our data: 
races = races[, c("raceId", "circuitId")]
data <- merge(data, races, by = "raceId")

data = data[, -1] #remove raceId as is no longer of interest for us. 

# I am also interested in having the circuits as a categorical variable: 
circuits = circuits[, c("circuitId", "circuitRef")]
data = merge(data, circuits, by = "circuitId")

# I am also interested in having the drivers as a categorical variable: 
drivers = drivers[, c("driverId", "driverRef")]
data = merge(data, drivers, by = "driverId")


data$circuitRef = as.factor(data$circuitRef)
data$driverRef = as.factor(data$driverRef)

# Removing numerical variables as we have the categorical ones we are going to use: 
data = data[,-1]
data = data[,-1]

#Now that the data is more clean we are going to try to understand how each NUMERICAL variable is related to our target:

attach(data)

hist(stop,col="darkgreen"); hist(lap,col="darkgreen"); hist(milliseconds,col="darkgreen");hist(log(milliseconds),col="darkgreen"); 

summary(milliseconds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12897   21831   23376   24142   25471   39980
sd(milliseconds)
## [1] 3890.75

Checking this plots we can see that the number of stops on a race is usually 1 or 2 and it?s a rare event to have more. Regarding the number of laps we can see that races use to have around 40-60 laps, and the pit stops are done between laps 10 and 40 mostly.

We are not going to use the variables stop and laps as those variables are not related at all with the time a car takes to stop, we are going to see that now with a correlation plot.

Before that, remark that as we have removed many outliers, for the variable we want to estimate, the milliseconds, we have long tails at the sides, what is going to be bad to define a distribution.

Now, we are checking for correlation between the variables:

ggcorr(data, label=T)
## Warning in ggcorr(data, label = T): data in column(s) 'circuitRef', 'driverRef'
## are not numeric and were ignored

The only correlation is between stop and lap. This is not interesting for us as the higher the time goes in a race, the higher the number of laps have been run and the higher the number of stops a car will have done.

Now we are going to see 2 very interesting plots regarding the categorical variables that allow us to see how relavant they are:

top_circuits <- names(sort(table(data$circuitRef), decreasing = TRUE))[1:15]
data_filtered <- subset(data, circuitRef %in% top_circuits)
ggplot(data_filtered, aes(fill = circuitRef, y = milliseconds)) + geom_boxplot() + labs(title= "Pit-Stop time of Top 15 Circuits")

Depending on the circuit, the distribution of the time in a pit-stop changes a lot. This is the variable we are going to use to predict the time of a pit-stop as depending on the shape of the circuit we could have a different expected time for a pit-stop.

# Filter the drivers to only include the top 10, with more races. 
top_drivers <- names(sort(table(data$driverRef), decreasing = TRUE))[1:15]
data_filtered <- subset(data, driverRef %in% top_drivers)

ggplot(data_filtered, aes(fill = driverRef, y = milliseconds)) + geom_boxplot() + labs(title= "Pit-Stop time of Top 15 Drivers")

Regarding the drivers, we can see on the boxplots that more or less all of them follow the same distribution and times on the pit-stops. They will not be significant to predict the milliseconds, but we are going to check it with the density plot too.

p = ggplot(data)+aes(x=milliseconds, fill = circuitRef) + geom_density(alpha = 0.5)  + ggtitle("Pit-Stops(ms) depending on the circuit"); ggplotly(p)

Here we confirm our hypothesis. Each circuit has a diferent time distribution, as the shape is different and the time for a pit stop changes.

p = ggplot(data_filtered)+aes(x=milliseconds, fill = driverRef) + geom_density(alpha = 0.5)  + ggtitle("Pit-Stops(ms) depending on the driver"); ggplotly(p)

By looking at this graph we can state that the driver is not useful to know the pit-stop time of a car in a circuit as all the drivers follow the same distribution. But we can also guess that the distribution has 2 main maximums so this may mean that it is bivariate

Defining a Benchmark

First of all, given all the observations for each circuit, we are going to compute the mean time for all of them and check the results. Any result that is worse than this one, will be considered not usefull at all.

results_benchmark <- aggregate(milliseconds ~ circuitRef, data = data, FUN = mean)
aux = data[, c("circuitRef", "milliseconds")]

results_benchmark <- merge(aux, results_benchmark, by = "circuitRef")
ggplot(results_benchmark, aes(x = milliseconds.x, y = ..density..)) +
  geom_histogram(fill = "darkgreen", bins = 40) +
  geom_density(aes(x = milliseconds.y), color = "black", size = 1) +
  labs(x = "Milliseconds", y = "Density", title = "Benchmark averaging the value for each circuit") +
  theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

This is the easiest result we can get by computing the mean of each circuit and predicting all the time with that mean.

Frequentist approach

We are getting R^2 predicting the time using CircuitRef and assuming that the data follows a Gamma distribution:

# Get unique driver IDs from the dataset
circuit = unique(data$circuitRef)
# Get fit model
fit = glm(milliseconds~circuitRef, data=data, family="gaussian")
summary(fit)
## 
## Call:
## glm(formula = milliseconds ~ circuitRef, family = "gaussian", 
##     data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -10049.7   -1195.8    -276.6     685.4   17673.3  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               23202.0      164.4 141.154  < 2e-16 ***
## circuitRefamericas         1887.0      226.6   8.327  < 2e-16 ***
## circuitRefbahrain          1835.2      203.1   9.034  < 2e-16 ***
## circuitRefbaku            -2263.4      257.7  -8.782  < 2e-16 ***
## circuitRefbuddh             440.3      309.9   1.421 0.155441    
## circuitRefcatalunya       -1012.0      202.4  -5.001 5.81e-07 ***
## circuitRefhockenheimring  -2645.5      234.1 -11.303  < 2e-16 ***
## circuitRefhungaroring      -830.3      202.9  -4.092 4.31e-05 ***
## circuitRefimola            7974.8      359.6  22.180  < 2e-16 ***
## circuitRefinterlagos       -235.5      203.5  -1.157 0.247134    
## circuitRefistanbul          126.8      293.2   0.432 0.665425    
## circuitRefjeddah           -793.3      536.3  -1.479 0.139126    
## circuitReflosail           3216.1      544.3   5.908 3.58e-09 ***
## circuitRefmarina_bay       6840.2      209.2  32.691  < 2e-16 ***
## circuitRefmiami           -3088.8      591.7  -5.220 1.83e-07 ***
## circuitRefmonaco           3152.8      227.6  13.853  < 2e-16 ***
## circuitRefmonza            1855.6      229.8   8.073 7.70e-16 ***
## circuitRefmugello          -917.1      473.3  -1.937 0.052730 .  
## circuitRefnurburgring     -1388.0      293.2  -4.733 2.24e-06 ***
## circuitRefportimao         1984.3      349.4   5.680 1.39e-08 ***
## circuitRefred_bull_ring   -1160.4      223.4  -5.194 2.10e-07 ***
## circuitRefricard           8125.7      347.8  23.364  < 2e-16 ***
## circuitRefrodriguez         226.1      263.3   0.859 0.390467    
## circuitRefsepang           1966.7      221.3   8.887  < 2e-16 ***
## circuitRefshanghai         -784.4      211.0  -3.718 0.000202 ***
## circuitRefsilverstone      5184.3      212.8  24.359  < 2e-16 ***
## circuitRefsochi            7534.6      264.1  28.527  < 2e-16 ***
## circuitRefspa               111.7      216.9   0.515 0.606804    
## circuitRefsuzuka            519.3      217.6   2.386 0.017055 *  
## circuitRefvalencia        -1232.5      312.9  -3.940 8.22e-05 ***
## circuitRefvilleneuve        585.6      225.8   2.593 0.009519 ** 
## circuitRefyas_marina      -1018.4      220.0  -4.629 3.72e-06 ***
## circuitRefyeongam          -910.0      292.5  -3.111 0.001871 ** 
## circuitRefzandvoort       -3703.5      328.3 -11.280  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 8078615)
## 
##     Null deviance: 1.3877e+11  on 9167  degrees of freedom
## Residual deviance: 7.3790e+10  on 9134  degrees of freedom
## AIC: 171868
## 
## Number of Fisher Scoring iterations: 2
# Initialize vector to store predictions
frequentist_means <- rep(NA, length(circuit))

# Fit GLM and make predictions for each driver
for (i in 1:length(circuit)) {
  frequentist_means[i] <- predict(fit, newdata = data.frame(circuitRef = circuit[i]), type = "response")
}

# Combine results into a data frame
results <- data.frame(circuitRef = circuit, frequentist_time = frequentist_means)
aux = data[, c("circuitRef", "milliseconds")]

results <- merge(aux, results, by = "circuitRef")


MSE = (1 / length(results) *  sum(((results$milliseconds - results$frequentist_time)/mean(results$milliseconds))^2)); MSE
## [1] 42.20223
R_squared = cor(results$milliseconds, results$frequentist_time)^2; R_squared 
## [1] 0.4682543
# It?s the correlation squared. 
# Explains how well the model explains the variability of the milliseconds. 

On the model summary we can see that all the betas are significant as none is zero.
Here we have used the categorical value of the circuit, the name, and the model explains 46% of the variability of the pit-stops. Which seems to be a quite good result.

Notice that this is not exactly the MSE error as I am dividing the results over the mean average time for that circuit.I will not trust this “MSE” number but is just to compare.

Just for comparison, we are going to apply the glm model with families “gaussian”, “Gamma” and “inverse.gaussian” and compare the results in a table.

families = c("gaussian", "Gamma", "inverse.gaussian")

# Get unique driver IDs from the dataset
circuit = unique(data$circuitRef)

Rs_squared = AICs = MSEs = Deviances =  rep(NA, length(families))

# Get fit model
for (j in 1:length(families)) {

  fit = glm(milliseconds~circuitRef, data=data, family=families[j])
  # Initialize vector to store predictions
  frequentist_means <- rep(NA, length(circuit))
  
  # Fit GLM and make predictions for each driver
  for (i in 1:length(circuit)) {
    frequentist_means[i] <- predict(fit, newdata = data.frame(circuitRef = circuit[i]), type = "response")
  }
  
  # Combine results into a data frame
  results <- data.frame(circuitRef = circuit, frequentist_time = frequentist_means)
  aux = data[, c("circuitRef", "milliseconds")]
  
  results <- merge(aux, results, by = "circuitRef")
  
  MSEs[j] = (1 / length(results) *  sum((results$milliseconds - results$frequentist_time)^2));
  
  Rs_squared[j] = cor(results$milliseconds, results$frequentist_time)^2; R_squared # It?s the correlation squared. 
# Explains how well the model explains the variability of the milliseconds.
  AICs[j] = fit$aic
  Deviances[j] = fit$null.deviance
}

models <- data.frame(Family = families, R_Squared = Rs_squared, AIC =AICs, MSE = MSEs, Deviance = Deviances); models
##             Family R_Squared      AIC         MSE     Deviance
## 1         gaussian 0.4682543 171868.2 24596689450 1.387695e+11
## 2            Gamma 0.4682543 171273.6 24596689450 2.255710e+02
## 3 inverse.gaussian 0.4682543 171281.3 24596689450 9.309282e-03

A gaussian model may be good, but the AIC for a Gamma model is a bit lower. This seems to be correct as time is well predicted using a Gamma distribution in general.

At the end of this document we will follow a Gamma distribution using a Bayesian approach.

Frequentist approach getting R^2 with DriverId

We do not expect any variations in this results, but we are doing it just for checking the results.

# Get unique driver IDs from the dataset
driver = unique(data$driverRef)
# Get fit model
fit = glm(milliseconds~driverRef, data=data, family="Gamma")

# Initialize vector to store predictions
frequentist_means <- rep(NA, length(driver))

# Fit GLM and make predictions for each driver
for (i in 1:length(driver)) {
  frequentist_means[i] <- predict(fit, newdata = data.frame(driverRef = driver[i]), type = "response")
}

# Combine results into a data frame
results <- data.frame(driverRef = driver, frequentist_time = frequentist_means)
aux = data[, c("driverRef", "milliseconds")]

results <- merge(aux, results, by = "driverRef")

#MSE = (1/n) * sum((y_true - y_pred)^2)
#R-squared = 1 - (sum((y_true - y_pred)^2) / sum((y_true - mean(y_true))^2))

MSE = (1 / length(results) *  sum(((results$milliseconds - results$frequentist_time)/mean(results$milliseconds))^2)); MSE
## [1] 76.89451
R_squared = cor(results$milliseconds, results$frequentist_time)^2; R_squared # It?s the correlation squared. 
## [1] 0.0311336
# Explains how well the model explains the variability of the milliseconds. 

remove(aux)

We can see that here the R^2 is super slow, as expected. (Driver?s skills does not depend on the time of a pit stop as it does the circuit in which it is done). So we will not use the drivers to predict a pit-stop time.

Frequent approach getting R^2 with CircuitRef but now providing us the pilot and using the pilot in the model.

Just to prove it again, for a given driver, the time stop we get.

# Get unique driver IDs from the dataset
circuit = unique(data$circuitRef)
# Get fit model
fit = glm(milliseconds~driverRef+circuitRef, data=data, family="Gamma")

# Initialize vector to store predictions
frequentist_means <- rep(NA, length(circuit))

# Fit GLM and make predictions for each driver
for (i in 1:length(circuit)) {
  frequentist_means[i] <- predict(fit, newdata = data.frame(driverRef = "alonso", circuitRef = circuit[i]), type = "response")
}

# Combine results into a data frame
results <- data.frame(circuitRef = circuit, frequentist_time = frequentist_means)
aux = data[, c("circuitRef", "milliseconds")]

results <- merge(aux, results, by = "circuitRef")

MSE = (1 / length(results) *  sum(((results$milliseconds - results$frequentist_time)/mean(results$milliseconds))^2)); MSE
## [1] 44.74101
R_squared = cor(results$milliseconds, results$frequentist_time)^2; R_squared # It?s the correlation squared. 
## [1] 0.4674421
# Explains how well the model explains the variability of the milliseconds. 

R^2 is more or less the same as before and the results do not change yet.

ggplot(results, aes(x = milliseconds, y = ..density..)) +
  geom_histogram(fill = "darkgreen", bins = 40) +
  geom_density(aes(x = frequentist_time), color = "black", size = 1) +
  labs(x = "Milliseconds", y = "Density", title = "Histogram with Frequentist Prediction for an specific driver") +
  theme_bw()

We can see that the pilot is not relevant for a lap time. Results don?t change with or without it. This 46% is the same one as before. And the histogram is also quite similar.

Up to now, we have done lot of trials but we are not getting better results than the ones from the benchmark, we can see that on the graph itself.

Frequentist approach with logarithms.

If we apply the log to the milliseconds data, this is the results we get:

data$milliseconds = log(data$milliseconds)
# Get unique driver IDs from the dataset
circuit = unique(data$circuitRef)
# Get fit model
fit = glm(milliseconds~circuitRef, data=data, family="Gamma")

# Initialize vector to store predictions
frequentist_means <- rep(NA, length(circuit))

# Fit GLM and make predictions for each driver
for (i in 1:length(circuit)) {
  frequentist_means[i] <- predict(fit, newdata = data.frame(circuitRef = circuit[i]), type = "response")
}

# Combine results into a data frame
results <- data.frame(circuitRef = circuit, frequentist_time = frequentist_means)
aux = data[, c("circuitRef", "milliseconds")]

results <- merge(aux, results, by = "circuitRef")

results$milliseconds = exp(results$milliseconds)
results$frequentist_time = exp(results$frequentist_time)

MSE = (1 / length(results) *  sum(((results$milliseconds - results$frequentist_time)/mean(results$milliseconds))^2)); MSE
## [1] 42.34431
R_squared = cor(results$milliseconds, results$frequentist_time)^2; R_squared # It?s the correlation squared. 
## [1] 0.4680891
# Explains how well the model explains the variability of the milliseconds. 

data$milliseconds = exp(data$milliseconds)
ggplot(results, aes(x = milliseconds, y = ..density..)) +
  geom_histogram(fill = "darkgreen", bins = 40) +
  geom_density(aes(x = frequentist_time), color = "black", size = 1) +
  labs(x = "Milliseconds", y = "Density", title = "Frequentist Prediction applying logarithms") +
  theme_bw()

So mostly the same as before. I applied logarithms to the original data, computed the model and I have done the exponential before the plot just to show it. Let?s go for a Bayesian approach and hope we get better results.

Bayesian Approach

First of all let?s get a good Bayesian model:

bayes.model <- MCMCregress(milliseconds ~ circuitRef, data=data, burnin= 1000, nitt = 10000)
summary(bayes.model)
## 
## Iterations = 1001:11000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                               Mean       SD Naive SE Time-series SE
## (Intercept)                23200.4    163.5    1.635          1.635
## circuitRefamericas          1889.7    226.2    2.262          2.308
## circuitRefbahrain           1836.0    202.8    2.028          2.060
## circuitRefbaku             -2263.7    255.8    2.558          2.558
## circuitRefbuddh              442.6    309.9    3.099          3.099
## circuitRefcatalunya        -1009.6    200.9    2.009          2.009
## circuitRefhockenheimring   -2641.9    230.3    2.303          2.303
## circuitRefhungaroring       -828.0    201.7    2.017          2.017
## circuitRefimola             7977.8    357.6    3.576          3.576
## circuitRefinterlagos        -233.2    205.4    2.054          2.054
## circuitRefistanbul           127.2    292.9    2.929          2.929
## circuitRefjeddah            -788.6    527.3    5.273          5.273
## circuitReflosail            3209.2    539.9    5.399          5.399
## circuitRefmarina_bay        6840.4    208.0    2.080          2.080
## circuitRefmiami            -3087.8    588.1    5.881          6.021
## circuitRefmonaco            3154.9    226.5    2.265          2.265
## circuitRefmonza             1859.5    230.8    2.308          2.308
## circuitRefmugello           -912.3    470.7    4.707          4.707
## circuitRefnurburgring      -1386.2    289.1    2.891          2.941
## circuitRefportimao          1984.3    352.1    3.521          3.521
## circuitRefred_bull_ring    -1160.4    224.7    2.247          2.247
## circuitRefricard            8129.4    348.8    3.488          3.488
## circuitRefrodriguez          228.4    261.6    2.616          2.616
## circuitRefsepang            1969.4    222.3    2.223          2.223
## circuitRefshanghai          -783.4    211.6    2.116          2.116
## circuitRefsilverstone       5186.2    211.9    2.119          2.202
## circuitRefsochi             7534.6    265.7    2.657          2.705
## circuitRefspa                111.6    217.1    2.171          2.171
## circuitRefsuzuka             521.1    216.9    2.169          2.169
## circuitRefvalencia         -1230.3    314.7    3.147          3.147
## circuitRefvilleneuve         588.4    224.9    2.249          2.249
## circuitRefyas_marina       -1014.5    220.2    2.202          2.202
## circuitRefyeongam           -907.7    294.1    2.941          2.941
## circuitRefzandvoort        -3699.8    326.5    3.265          3.202
## sigma2                   8081914.5 121694.8 1216.948       1216.948
## 
## 2. Quantiles for each variable:
## 
##                                2.5%        25%       50%        75%      97.5%
## (Intercept)                22874.45   23090.31   23201.7   23309.54  2.352e+04
## circuitRefamericas          1449.57    1737.43    1888.5    2039.93  2.337e+03
## circuitRefbahrain           1447.75    1700.15    1833.1    1970.27  2.236e+03
## circuitRefbaku             -2756.00   -2436.69   -2265.9   -2093.89 -1.749e+03
## circuitRefbuddh             -164.03     232.01     439.6     652.60  1.054e+03
## circuitRefcatalunya        -1403.25   -1142.86   -1008.4    -873.74 -6.184e+02
## circuitRefhockenheimring   -3090.48   -2798.32   -2644.4   -2485.84 -2.189e+03
## circuitRefhungaroring      -1221.89    -964.62    -828.2    -691.74 -4.288e+02
## circuitRefimola             7271.92    7739.95    7975.2    8223.98  8.661e+03
## circuitRefinterlagos        -631.54    -369.71    -235.2     -94.49  1.704e+02
## circuitRefistanbul          -449.06     -72.78     128.1     321.69  7.050e+02
## circuitRefjeddah           -1809.97   -1153.94    -791.6    -427.24  2.375e+02
## circuitReflosail            2152.33    2842.49    3210.9    3574.31  4.276e+03
## circuitRefmarina_bay        6437.62    6701.35    6839.9    6976.98  7.251e+03
## circuitRefmiami            -4257.95   -3477.51   -3095.6   -2688.27 -1.938e+03
## circuitRefmonaco            2717.63    3002.03    3155.2    3306.17  3.602e+03
## circuitRefmonza             1409.00    1702.22    1859.3    2015.59  2.309e+03
## circuitRefmugello          -1848.25   -1231.99    -907.1    -596.37  3.659e+00
## circuitRefnurburgring      -1942.60   -1585.80   -1388.2   -1188.76 -8.224e+02
## circuitRefportimao          1301.10    1743.77    1983.1    2225.60  2.670e+03
## circuitRefred_bull_ring    -1599.31   -1315.04   -1158.5   -1011.66 -7.124e+02
## circuitRefricard            7446.37    7891.71    8127.0    8366.60  8.815e+03
## circuitRefrodriguez         -277.96      50.53     225.1     406.54  7.400e+02
## circuitRefsepang            1529.93    1820.73    1970.7    2118.40  2.410e+03
## circuitRefshanghai         -1197.24    -923.93    -783.6    -643.50 -3.697e+02
## circuitRefsilverstone       4773.72    5042.91    5184.9    5330.04  5.603e+03
## circuitRefsochi             7028.03    7350.42    7533.5    7714.10  8.051e+03
## circuitRefspa               -308.47     -34.98     112.0     258.55  5.283e+02
## circuitRefsuzuka              99.55     377.28     521.8     666.15  9.521e+02
## circuitRefvalencia         -1838.43   -1442.70   -1234.4   -1020.44 -6.065e+02
## circuitRefvilleneuve         147.72     437.07     589.7     739.34  1.027e+03
## circuitRefyas_marina       -1442.82   -1163.42   -1016.8    -866.05 -5.801e+02
## circuitRefyeongam          -1489.53   -1105.32    -904.2    -707.47 -3.316e+02
## circuitRefzandvoort        -4327.40   -3922.26   -3703.5   -3481.57 -3.062e+03
## sigma2                   7850005.14 7998443.86 8080393.9 8164539.11  8.324e+06
#Checking for autocorrelation: 
acf(bayes.model[,1])

acf(bayes.model[,2])

acf(bayes.model[,20])

As we can see, there is almost no autocorrelation so the model can be implemented and used.

Predictive Distribution

We are going to compute the predictive distribution for circuit “Americas”

avg = bayes.model[,1] + bayes.model[,2] # 1 corresponds to albert-park, 2 to americas circuit and so on. 

america_circuit = rnorm(length(bayes.model[,1]), mean = mean(avg), sd = sd(avg)) # Predictive Distribution

hist(america_circuit, col = "darkgreen", main = "Predictive distribution for America Circuit", xlab = "Time (ms)")

mean(america_circuit)
## [1] 25089.89
# This is a predictive interval: 
quantile(america_circuit, c(0.025, 0.975)) # It is going to be much width-er than the average.
##     2.5%    97.5% 
## 24787.26 25390.28

This is the predicted distribution generating the samples with normal distribution and using MCMCregress.

We are going to compute the predictive distribution for circuit “Imola”

avg = bayes.model[,1] + bayes.model[,9] # 1 corresponds to albert-park, 2 to americas circuit, 3 for Bahrain... 

circuit = rnorm(length(bayes.model[,1]), mean = mean(avg), sd = sd(avg)) # Predictive Distribution

# This is a predictive interval: 
quantile(circuit, c(0.025, 0.975)) # It is going to be much width-er than the average.
##     2.5%    97.5% 
## 30548.63 31811.03
mean(circuit)
## [1] 31177.83
hist(circuit, col = "darkgreen", main = "Predictive distribution for Imola Circuit", xlab = "Time (ms)")

I have chosen this circuit because the value of the Beta that corresponds to it is quite high meaning that the milliseconds times are going to be different.

Now let?s get a predicted value using as posterior distribution a Normal.

bayes.model <- MCMCregress(milliseconds ~ circuitRef, data=data, burnin= 1000, nitt = 12000, thin= 10)

Im introducing some thinning in the model just for the case in which there could be some autocorrelation in some of the predictors and we did not notice.

Now, we compute the prediction of the milliseconds for each of the observations of the data but following the Bayesian model:

# Get unique driver IDs from the dataset
circuit = sort(unique(data$circuitRef))
# Get bayes.model model

# Initialize vector to store predictions
bayesian_means <- rep(NA, nrow(data))

# Fit GLM and make predictions for each driver
for (i in 1:nrow(data)) {
  index <- which(circuit == data$circuitRef[i])
  avg = bayes.model[,index]
  if(index>1){
    avg = avg + bayes.model[1] # We sum the parameter corresponding to circuitRef plus B0
  }
  
  bayesian_means[i] = rnorm(1, mean = mean(avg), sd = sd(avg)) # Predictive Distribution
}

# Combine results into a data frame
results_bayes <- data.frame(circuitRef = data$circuitRef, milliseconds = data$milliseconds,bayesian_time = bayesian_means)
#results$milliseconds = exp(results_bayes$milliseconds)

#MSE = (1/n) * sum((y_true - y_pred)^2)
#R-squared = 1 - (sum((y_true - y_pred)^2) / sum((y_true - mean(y_true))^2))

MSE = (1 / length(results) *  sum(((results_bayes$milliseconds - results_bayes$bayesian_time)/mean(results_bayes$milliseconds))^2)); MSE
## [1] 42.47066
R_squared = cor(results_bayes$milliseconds, results_bayes$bayesian_time)^2; R_squared # It?s the correlation squared. 
## [1] 0.4655851
# Explains how well the model explains the variability of the milliseconds. 

# R_squared = 1 - (sum((results_bayes$milliseconds - results_bayes$bayesian_time)^2) / sum((results_bayes$milliseconds - mean(results_bayes$bayesian_time))^2)); R_squared

hist(results_bayes$milliseconds, col = "darkgreen", main = "Real Distribution")

hist(results_bayes$bayesian_time, col = "darkgreen", main = "Predicted Distribution")

ggplot(results, aes(x = milliseconds, y = ..density..)) +
  geom_histogram(fill = "darkgreen", bins = 40) +
  geom_density(aes(x = frequentist_time), color = "black", size = 1) +
  labs(x = "Milliseconds", y = "Density", title = "Histogram with Previous Frequentist Prediction") +
  theme_bw()

The previous plot of Frequentist approach that we can use to compare models.

ggplot(results_bayes, aes(x = milliseconds, y = ..density..)) +
  geom_histogram(fill = "darkgreen", bins = 40,  boundary = 100) +
  geom_density(aes(x = bayesian_time), color = "black", size = 1) +
  labs(x = "Milliseconds", y = "Density", title = "Histogram with Bayesian Prediction") +
  theme_bw()

Bayesian results. They are quite similar to the previous approach and to the benchmark. We are using a normal distribution to predict data that does not really follow a normal so our results are not so accurate, as expected.

the good thing about this last prediction is that the model is finding well the 2 main picks of the data that we saw in the original plots.

Bayesian Approach Using a Gamma Distribution:

We are going to use Markov Chain simulation by defining a non-informative prior to compute the posterior and predictive distributions:

#Assume a non-informative prior as we dont know the prediction for a circuit: 
nu.min=0; nu.max=100; a=0.01; b=0.01;
n=length(milliseconds)

# Set the number of burnin iterations: 
burnin=1000; iters=40000
T=burnin+iters

# Initializing the vectors for nu and lambda: 
nu=rep(0,T); lam=rep(0,T)

# initial values
nu[1]=10; lam[1]=5;
pac=0;

# Aplying Markov Chain simulation: 

for (t in 2:T) {
  lam[t]=rgamma(1,shape=a+n*nu[t-1],rate=b+sum(milliseconds))
  nuc=rnorm(1,nu[t-1],sd=0.1)
  if(nuc<nu.min || nuc>nu.max){nu[t]=nu[t-1]}
  else{
    logal=(nuc-1)*sum(log(milliseconds))-n*lgamma(nuc)+n*nuc*log(lam[t])
    logal=logal-(nu[t-1]-1)*sum(log(milliseconds))+n*lgamma(nu[t-1])
    logal=logal-n*nu[t-1]*log(lam[t])
    u=runif(1)
    if (log(u)<logal){
      nu[t]=nuc; if (t>burnin){pac=pac+1}
    }
    else nu[t]=nu[t-1]
  }
}

We can apply thinning to reduce the autocorrelation. Let?s select only the 40-th observation:

nu.post=nu[seq(burnin+1,iters,by=40)]
lam.post=lam[seq(burnin+1,iters,by=40)]

acf(nu.post)

acf(lam.post)

As we still have lot of autocorrelation, let?s try to reduce it, now we keep only one out of 100 observations:

nu.post=nu[seq(burnin+1,iters,by=100)]
lam.post=lam[seq(burnin+1,iters,by=100)]
ts.plot(nu.post)

ts.plot(lam.post)

acf(nu.post)

acf(lam.post)

# Proportion of accepted values
pac=pac/iters
pac
## [1] 0.586575

We have reduced a lot the autocorrelation. By selecting only 1 of a 100 observations, this is good for the model.

Let?s see the posterior distribution:

hist(nu.post, col = "darkgreen")

hist(lam.post, col = "darkgreen")

We can obtain 95% intervals for nu and lambda:

quantile(nu.post,c(0.025,0.975))
##     2.5%    97.5% 
## 39.45077 42.03904
quantile(lam.post,c(0.025,0.975))
##        2.5%       97.5% 
## 0.001633564 0.001741510
mean(nu.post); mean(lam.post)
## [1] 40.74967
## [1] 0.001688082

#Prediction probabilities:

M=100000
time.pred=rgamma(M,shape=nu.post,rate=lam.post)
hist(time.pred, col = "darkgreen", main = "Predicted time following a Gamma", xlab = "time (ms)")

Let?s compute the probability of having a pit stop that lasts longer than 25 seconds:

mean(time.pred>25000)
## [1] 0.38805
# Interval for the predicted time of a pit stop: 
quantile(time.pred,c(0.025,0.975))
##     2.5%    97.5% 
## 17293.14 32108.35

We can see that the interval is quite large, goes from 17 to 32 seconds.

hist(milliseconds,freq=F, col = "darkgreen", main = "Bayesian Prediction following a Gamma", xlab = "time (ms)" )
lines(density(time.pred))

To finish we obtain in green the original distribution of the milliseconds and in black, the fitted line following a gamma distribution. This results seem to be better for me than the ones following Gaussian models that we have seen before as the line fits well the data.

The problem with the data we had is that we are provided with the circuit in which the race is done and the the time any car takes to stop and change its wheels. The problem is that we do not have lot of information regarding the car status or any kind of condition that helps us determine the time the car will be stopped.

Sometimes this happens with the data and we have to work with it. We are going to predict the time of a pit-stop depending on the circuit and not in the conditions of the car or the race, so the results we get are quite similar to the ones that we can get by hand if we compute the average of all the observations that we hold for each circuit.